Exploratory Data Analysis of Significant Earthquakes¶

Table of contents

  • Exploratory Data Analysis of Significant Earthquakes
    • Project Setup and Data Loading
    • About Dataset
      • Context
      • Content
      • Columns
        • Basic Earthquake Info
        • Date
        • Location
        • Magnitudes
        • Earthquake Effects
        • Total Earthquake Effects
    • Feature Engineering
      • Data Understanding and Cleaning
        • Basic Earthquake Info
        • Location
        • Date data
        • Magnitudes
        • Earthquake Effects
        • Total Effects
      • Null Values, Row and Column Selection
        • Dropping Columns
        • Analysis of recorded earthquakes by centuries and decades
        • Dropping rows
        • Handling null values
    • Statistics by time
      • Number of earthquakes over time
      • Earthquake characteristics over time
    • Statistics by location
      • Statistics by countries
      • Maps
      • Analysis for specific countries
        • Turkey
        • Serbia
    • Damage, Injuries and Death Analysis
      • Damage analysis
      • Number of deaths Analysis
      • Number of injuries analysis
    • Tsunami Analysis

Project Setup and Data Loading¶

Setup Completed!
Flag Tsunami Year Month Day Focal Depth EQ Primary Mw Magnitude Ms Magnitude Mb Magnitude Ml Magnitude ... Total Effects : Missing Description Total Effects : Injuries Total Effects : Injuries Description Total Effects : Damages in million Dollars Total Effects : Damage Description Total Effects : Houses Destroyed Total Effects : Houses Destroyed Description Total Effects : Houses Damaged Total Effects : Houses Damaged Description Coordinates
ID Earthquake
78 NaN 334 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 41.2, 19.3
84 Tsunami 344 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN SEVERE (~>$5 to $24 million) NaN NaN NaN NaN 40.3, 26.5
9989 Tsunami 346 NaN NaN NaN 6.8 NaN 6.8 NaN NaN ... NaN NaN NaN NaN MODERATE (~$1 to $5 million) NaN Many (~101 to 1000 houses) NaN NaN 41.4, 19.4
110 NaN 438 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 35.5, 25.5
9971 Tsunami 557 NaN NaN NaN 7.0 NaN 7.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.9, 27.6

5 rows × 42 columns

Number of rows: 6208
Number of columns: 42
List of all columns:
['Flag Tsunami', 'Year', 'Month', 'Day', 'Focal Depth', 'EQ Primary', 'Mw Magnitude', 'Ms Magnitude', 'Mb Magnitude', 'Ml Magnitude', 'MFA Magnitude', 'Unknown Magnitude', 'Intensity', 'Country', 'State', 'Location name', 'Region code', 'Earthquake : Deaths', 'Earthquake : Deaths Description', 'Earthquake : Missing', 'Earthquake : Missing Description', 'Earthquake : Injuries', 'Earthquake : Injuries Description', 'Earthquake : Damage (in M$)', 'Earthquake : Damage Description', 'Earthquakes : Houses destroyed', 'Earthquakes : Houses destroyed Description', 'Earthquakes : Houses damaged', 'Earthquakes : Houses damaged Description', 'Total Effects : Deaths', 'Total Effects : Deaths Description', 'Total Effects : Missing', 'Total Effects : Missing Description', 'Total Effects : Injuries', 'Total Effects : Injuries Description', 'Total Effects : Damages in million Dollars', 'Total Effects : Damage Description', 'Total Effects : Houses Destroyed', 'Total Effects : Houses Destroyed Description', 'Total Effects : Houses Damaged', 'Total Effects : Houses Damaged Description', 'Coordinates']

About Dataset¶

Context¶

The Significant Earthquake Database is a global listing of over 6,200 earthquakes from 2150 BC to the present. Datset can be found on OpenDatasoft website.

Content¶

A significant earthquake is classified as one that meets at least one of the following criteria: caused deaths, caused moderate damage (approximately 1 million dollars or more), magnitude 7.5 or greater, Modified Mercalli Intensity (MMI) X or greater, or the earthquake generated a tsunami. The database provides information on the date and time of occurrence, latitude and longitude, focal depth, magnitude, maximum MMI intensity, and socio-economic data such as the total number of casualties, injuries, houses destroyed, and houses damaged, and $ dollage damage estimates. References, political geography, and additional comments are also provided for each earthquake. If the earthquake was associated with a tsunami or volcanic eruption, it is flagged and linked to the related tsunami event or significant volcanic eruption.

Columns¶

Basic Earthquake Info¶

  • Focal Depth - depth of epicenter in kilometers
  • EQ Primary - magnitude of earthquake (primary measured magnitude) from 1 to 10
  • Intensity - modified Mercalli intensity from 1 to 12
  • Flag Tsunami - true if tsunami earthquake trigged tsunami

Date¶

  • Year
  • Month
  • Day

Location¶

  • Coordinates - (latitude, longitude)
  • Country
  • State
  • Location name
  • Region code

Magnitudes¶

The magnitude is a measure of seismic energy. The magnitude scale is logarithmic. An increase of one in magnitude represents a tenfold increase in the recorded wave amplitude. However, the energy release associated with an increase of one in magnitude is not tenfold, but about thirtyfold. For example, approximately 900 times more energy is released in an earthquake of magnitude 7 than in an earthquake of magnitude 5. Each increase in magnitude of one unit is equivalent to an increase of seismic energy of about 1.6 x 10,000,000,000,000 ergs. All magnitudes have valid values between 0 and 10.

  • Mw Magnitude
    The Mw magnitude is based on the moment magnitude scale. Moment is a physical quantity proportional to the slip on the fault times the area of the fault surface that slips; it is related to the total energy released in the EQ. The moment can be estimated from seismograms (and also from geodetic measurements). The moment is then converted into a number similar to other earthquake magnitudes by a standard formula. The result is called the moment magnitude. The moment magnitude provides an estimate of earthquake size that is valid over the complete range of magnitudes, a characteristic that was lacking in other magnitude scales.
  • Ms Magnitude
    The Ms magnitude is the surface-wave magnitude of the earthquake.
  • Mb Magnitude
    The Mb magnitude is the compressional body wave (P-wave) magnitude.
  • Ml Magnitude
    The ML magnitude was the original magnitude relationship defined by Richter and Gutenberg for local earthquakes in 1935. It is based on the maximum amplitude of a seismogram recorded on a Wood-Anderson torsion seismograph. Although these instruments are no longer widely in use, ML values are calculated using modern instrumentation with appropriate adjustments.
  • MFA Magnitude
    The Mfa magnitudes are computed from the felt area, for earthquakes that occurred before seismic instruments were in general use.
  • Unknown Magnitude
    The computational method for the earthquake magnitude was unknown and could not be determined from the published sources.

Earthquake Effects¶

  • Earthquake : Deaths
  • Earthquake : Deaths Description
  • Earthquake : Missing
  • Earthquake : Missing Description
  • Earthquake : Injuries
  • Earthquake : Injuries Description
  • Earthquake : Damage (in M$)
  • Earthquake : Damage Description
  • Earthquakes : Houses destroyed
  • Earthquakes : Houses destroyed Description
  • Earthquakes : Houses damaged
  • Earthquakes : Houses damaged Description

Total Earthquake Effects¶

  • Total Effects : Deaths
  • Total Effects : Deaths Description
  • Total Effects : Missing
  • Total Effects : Missing Description
  • Total Effects : Injuries
  • Total Effects : Injuries Description
  • Total Effects : Damages in million Dollars
  • Total Effects : Damage Description
  • Total Effects : Houses destroyed
  • Total Effects : Houses destroyed Description
  • Total Effects : Houses damaged
  • Total Effects : Houses damaged Description

Feature Engineering¶

Feature Engineering will be performed for each of these column groups separately. First we will decide what columns to keeps, measure number of null values and decide what to do with them . Also we will consider does some features need to be combined in some way for better further analysis.

Data Understanding and Cleaning¶

Basic Earthquake Info¶

Focal Depth EQ Primary Intensity Flag Tsunami
ID Earthquake
10245 26.0 6.9 9.0 Tsunami
10267 39.0 7.1 9.0 NaN
10367 10.0 5.3 NaN NaN
10430 10.0 3.8 NaN NaN
10515 10.0 6.6 7.0 NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6208 entries, 78 to 10515
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Focal Depth   3243 non-null   float64
 1   EQ Primary    4416 non-null   float64
 2   Intensity     2826 non-null   float64
 3   Flag Tsunami  1838 non-null   object 
dtypes: float64(3), object(1)
memory usage: 242.5+ KB
Focal Depth EQ Primary Intensity
count 3243.000000 4416.000000 2826.000000
mean 41.064755 6.458084 8.283439
std 70.317966 1.045100 1.825092
min 0.000000 1.600000 2.000000
25% 10.000000 5.700000 7.000000
50% 25.000000 6.500000 8.000000
75% 40.000000 7.300000 10.000000
max 675.000000 9.500000 12.000000
Text(0, 0.5, 'Frequency')

From the distribution plot we can see that earthquake magnitude looks like normal distribution with mean somewhere around 6.5.

Text(0, 0.5, 'Frequency')

Here we most of earthquakes have focal depth less than 100 km. Also we can see that there are some outliers with focal depth more than 200 km.

Tsunami    1838
Name: Flag Tsunami, dtype: int64

We can see that if the tsunami occured we have flag set to Tsunami and if not the flag is left to be null. We will convert this column to boolean type with values True or False.

False    4370
True     1838
Name: Flag Tsunami, dtype: int64

Location¶

Coordinates Country State Location name Region code
ID Earthquake
10245 5.504, 125.066 PHILIPPINES NaN PHILIPPINES: SARANGANI 170.0
10267 18.339, -98.68 MEXICO NaN MEXICO: MEXICO CITY, MORELOS, PUEBLA 150.0
10367 26.374, 90.165 INDIA NaN INDIA: WEST BENGAL 60.0
10430 20.0, 72.9 INDIA NaN INDIA: MAHARASHTRA: PALGHAR 60.0
10515 12.021, 124.123 PHILIPPINES NaN PHILIPPINES: MASBATE 170.0
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6208 entries, 78 to 10515
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Coordinates    6153 non-null   object 
 1   Country        6208 non-null   object 
 2   State          323 non-null    object 
 3   Location name  6207 non-null   object 
 4   Region code    6207 non-null   float64
dtypes: float64(1), object(4)
memory usage: 291.0+ KB

We can see that for columns in this group we have small amount of null values, except for state. Let's see what are values of that row and how usefull they will be for further analysis.

State value counts:
CA     103
AK      82
HI      15
PR      14
GU      14
NV       9
NY       8
TAS      8
UT       8
VI       7
BC       6
OK       6
WA       6
MT       5
MP       4
MO       3
PA       3
WY       2
AR       2
KY       2
ID       2
MA       2
CO       2
OR       2
TX       1
CT       1
NC       1
AL       1
VA       1
NH       1
IL       1
SC       1
Name: State, dtype: int64

This seems to show state where the earthquake happended in United States. To prove the theory we will check what values do we have in column country, when state field is not null.

USA              267
USA TERRITORY     39
AUSTRALIA          8
CANADA             6
MEXICO             1
BERING SEA         1
GHANA              1
Name: Country, dtype: int64

We can see that previous hypothesis is mostly true. Because of that this column will be only usefull when analysing earthquakes in USA and it will mostly be ignored in further analysis.

Region code value counts:
10.0       75
15.0      107
20.0        4
30.0     1045
40.0      305
50.0      120
60.0      472
70.0       13
80.0        1
90.0      165
100.0     169
110.0      50
120.0     127
130.0     846
140.0     810
150.0     490
160.0     600
170.0     808
Name: Region code, dtype: int64

Date data¶

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6208 entries, 78 to 10515
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Year    6208 non-null   int64  
 1   Month   5800 non-null   float64
 2   Day     5646 non-null   float64
dtypes: float64(2), int64(1)
memory usage: 194.0 KB

As we can see on chart above, dataset contains much more earthquakes happened in recent years. Of cource this is because of the fact that we have more advanced technology to measure earthquakes nowadays. This is main reason to drop some older years from dataset.

We can conclude that month does not influence the number of earthquakes happening, as we cannot extract some general rule from chart above.

Magnitudes¶

Mw Magnitude Ms Magnitude Mb Magnitude Ml Magnitude MFA Magnitude Unknown Magnitude
ID Earthquake
10245 6.9 NaN NaN NaN NaN NaN
10267 7.1 NaN NaN NaN NaN NaN
10367 5.3 NaN NaN NaN NaN NaN
10430 NaN NaN NaN NaN NaN 3.8
10515 6.6 NaN NaN NaN NaN NaN
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6208 entries, 78 to 10515
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Mw Magnitude       1334 non-null   float64
 1   Ms Magnitude       2930 non-null   float64
 2   Mb Magnitude       1804 non-null   float64
 3   Ml Magnitude       184 non-null    float64
 4   MFA Magnitude      14 non-null     float64
 5   Unknown Magnitude  777 non-null    float64
dtypes: float64(6)
memory usage: 339.5 KB
Mw Magnitude Ms Magnitude Mb Magnitude Ml Magnitude MFA Magnitude Unknown Magnitude
count 1334.000000 2930.000000 1804.000000 184.000000 14.000000 777.000000
mean 6.513193 6.574198 5.792572 5.395109 6.771429 6.652638
std 0.928359 0.990792 0.724433 1.087850 1.230027 1.007854
min 3.600000 2.100000 2.100000 1.600000 4.300000 3.200000
25% 5.800000 5.800000 5.300000 4.775000 6.225000 6.000000
50% 6.500000 6.600000 5.800000 5.450000 7.050000 6.800000
75% 7.200000 7.300000 6.300000 6.025000 7.475000 7.500000
max 9.500000 9.100000 8.200000 7.700000 8.500000 8.800000

As we can see from heatplot correlation between different magnitudes are very high and also their distributions does not differ a lot. Let's see which magnitude is usually taken for EQ Primary measure, but whatever scale is most used, because of high correlations, for further analysis it would probably be enough to just look at EQ Primary measure.

All EQ Primary measures are taken from magnitudes if they are not null!

So we can claim that EQ Primary is just inferred from other magnitudes.

From last plot we can see that Ms, Mw and Unknown magnitudes are mostly used. All of them are very highly correlated by pairs (> 0.94), so it is safe enough to observe only EQ Primary in later analysis.

Earthquake Effects¶

Earthquake : Deaths Earthquake : Missing Earthquake : Injuries Earthquake : Damage (in M$) Earthquake : Houses Destroyed Earthquake : Houses Damaged Earthquake : Deaths Description Earthquake : Missing Description Earthquake : Injuries Description Earthquake : Damage Description Earthquake : Houses Destroyed Description Earthquake : Houses Damaged Description
ID Earthquake
10245 NaN NaN 5.0 NaN 1.0 NaN NaN NaN Few (~1 to 50 deaths) LIMITED (roughly corresponding to less than $1... Few (~1 to 50 houses) Few (~1 to 50 houses)
10267 369.0 NaN 6000.0 8000.000 226.0 184000.0 Many (~101 to 1000 deaths) NaN Very Many (~1001 or more deaths) EXTREME (~$25 million or more) Many (~101 to 1000 houses) Very Many (~1001 or more houses)
10367 1.0 NaN NaN NaN NaN NaN Few (~1 to 50 deaths) NaN NaN NaN NaN NaN
10430 1.0 NaN NaN NaN NaN NaN Few (~1 to 50 deaths) NaN Few (~1 to 50 deaths) LIMITED (roughly corresponding to less than $1... NaN Few (~1 to 50 houses)
10515 1.0 NaN 51.0 0.565 51.0 453.0 Few (~1 to 50 deaths) NaN Some (~51 to 100 deaths) LIMITED (roughly corresponding to less than $1... Some (~51 to 100 houses) Many (~101 to 1000 houses)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6208 entries, 78 to 10515
Data columns (total 12 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Earthquake : Deaths                        2069 non-null   float64
 1   Earthquake : Missing                       21 non-null     float64
 2   Earthquake : Injuries                      1244 non-null   float64
 3   Earthquake : Damage (in M$)                511 non-null    float64
 4   Earthquake : Houses Destroyed              786 non-null    float64
 5   Earthquake : Houses Damaged                490 non-null    float64
 6   Earthquake : Deaths Description            2551 non-null   object 
 7   Earthquake : Missing Description           21 non-null     object 
 8   Earthquake : Injuries Description          1432 non-null   object 
 9   Earthquake : Damage Description            4446 non-null   object 
 10  Earthquake : Houses Destroyed Description  1704 non-null   object 
 11  Earthquake : Houses Damaged Description    940 non-null    object 
dtypes: float64(6), object(6)
memory usage: 759.5+ KB
Earthquake : Deaths Earthquake : Missing Earthquake : Injuries Earthquake : Damage (in M$) Earthquake : Houses Destroyed Earthquake : Houses Damaged
count 2069.000000 21.000000 1244.000000 511.000000 7.860000e+02 4.900000e+02
mean 3748.109715 2182.761905 2173.584405 1252.089894 1.762395e+04 2.513319e+04
std 25333.721982 9463.737100 26271.351391 6733.908952 1.971549e+05 2.496489e+05
min 1.000000 1.000000 1.000000 0.013000 1.000000e+00 1.000000e+00
25% 3.000000 5.000000 10.000000 3.950000 6.425000e+01 9.000000e+01
50% 22.000000 21.000000 40.000000 22.000000 5.060000e+02 6.605000e+02
75% 305.000000 114.000000 200.000000 200.000000 4.000000e+03 3.465250e+03
max 830000.000000 43476.000000 799000.000000 100000.000000 5.360000e+06 5.360000e+06

From this statistics we can see that for this group we have a lot of null values.
Also, worth noticing is that maximum amount of casualties from earthquake effects is 830000 people. Let's see what earthquake caused this amount of casualties.

Flag Tsunami                                                               False
Year                                                                        1556
Month                                                                          1
Day                                                                           23
Focal Depth                                                                  NaN
EQ Primary                                                                   8.0
Mw Magnitude                                                                 NaN
Ms Magnitude                                                                 8.0
Mb Magnitude                                                                 NaN
Ml Magnitude                                                                 NaN
MFA Magnitude                                                                NaN
Unknown Magnitude                                                            NaN
Intensity                                                                   11.0
Country                                                                    CHINA
State                                                                        NaN
Location name                                           CHINA:  SHAANXI PROVINCE
Region code                                                                   30
Earthquake : Deaths                                                     830000.0
Earthquake : Deaths Description                 Very Many (~1001 or more deaths)
Earthquake : Missing                                                         NaN
Earthquake : Missing Description                                             NaN
Earthquake : Injuries                                                        NaN
Earthquake : Injuries Description                                            NaN
Earthquake : Damage (in M$)                                                  NaN
Earthquake : Damage Description                   EXTREME (~$25 million or more)
Earthquake : Houses Destroyed                                                NaN
Earthquake : Houses Destroyed Description                                    NaN
Earthquake : Houses Damaged                                                  NaN
Earthquake : Houses Damaged Description                                      NaN
Total Effects : Deaths                                                  830000.0
Total Effects : Deaths Description              Very Many (~1001 or more deaths)
Total Effects : Missing                                                      NaN
Total Effects : Missing Description                                          NaN
Total Effects : Injuries                                                     NaN
Total Effects : Injuries Description                                         NaN
Total Effects : Damages in million Dollars                                   NaN
Total Effects : Damage Description                EXTREME (~$25 million or more)
Total Effects : Houses Destroyed                                             NaN
Total Effects : Houses Destroyed Description                                 NaN
Total Effects : Houses Damaged                                               NaN
Total Effects : Houses Damaged Description                                   NaN
Latitude                                                                    34.5
Longitude                                                                  109.7
Name: 732, dtype: object
Plotting univariate distributions

From these distributions, because of the scale, we can see that all of them have some outliers with very large values. These outliers are probably earthquakes with most damage, deaths and other disasterous effects in history. We will further investigate that later, but for now it is important to notice them.

From correlation matrix we can see that number of deaths and missing are perfectly correlated. Combining that with the fact that there are only 21 non-null values of missing people, this column will not be particulary useful.
Number of houses damaged and destroyed are also perfectly correlated and because of that it will probably be enough to retain just one of these columns.
Beside that we can see that number of injuries is highly correlated with damaged and destroyed houses.

We can see that except damaged and destroyed houses number of earthquakes with more severe effects is less than with moderate and small effects. In these columns we have two None values, whose meaning is not clear yet. These distributions also does not take null values into account (there is a lot of them), so this analysis will be more meaningful when we use only newer data (not in whole history). Then one of the questions is to determine meaning of null values.

Total Effects¶

We can see that names of columns for total and earhquake effects are same. Because of that, first important question here is to see how much values differ in corresponding columns.

Total Effects : Deaths Total Effects : Missing Total Effects : Injuries Total Effects : Damage (in M$) Total Effects : Houses Destroyed Total Effects : Houses Damaged Total Effects : Deaths Description Total Effects : Missing Description Total Effects : Injuries Description Total Effects : Damage Description Total Effects : Houses Destroyed Description Total Effects : Houses Damaged Description
ID Earthquake
10245 NaN NaN 7.0 NaN 1.0 NaN NaN NaN Few (~1 to 50 deaths) LIMITED (roughly corresponding to less than $1... Few (~1 to 50 houses) Few (~1 to 50 houses)
10267 369.0 NaN 6000.0 8000.000 226.0 184000.0 Many (~101 to 1000 deaths) NaN Very Many (~1001 or more deaths) EXTREME (~$25 million or more) Many (~101 to 1000 houses) Very Many (~1001 or more houses)
10367 1.0 NaN NaN NaN NaN NaN Few (~1 to 50 deaths) NaN NaN NaN NaN NaN
10430 1.0 NaN NaN NaN NaN NaN Few (~1 to 50 deaths) NaN Few (~1 to 50 deaths) LIMITED (roughly corresponding to less than $1... NaN Few (~1 to 50 houses)
10515 1.0 NaN 51.0 0.565 51.0 453.0 Few (~1 to 50 deaths) NaN Some (~51 to 100 deaths) LIMITED (roughly corresponding to less than $1... Some (~51 to 100 houses) Many (~101 to 1000 houses)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6208 entries, 78 to 10515
Data columns (total 12 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   Total Effects : Deaths                        1702 non-null   float64
 1   Total Effects : Missing                       25 non-null     float64
 2   Total Effects : Injuries                      1259 non-null   float64
 3   Total Effects : Damage (in M$)                456 non-null    float64
 4   Total Effects : Houses Destroyed              817 non-null    float64
 5   Total Effects : Houses Damaged                428 non-null    float64
 6   Total Effects : Deaths Description            2040 non-null   object 
 7   Total Effects : Missing Description           26 non-null     object 
 8   Total Effects : Injuries Description          1441 non-null   object 
 9   Total Effects : Damage Description            3293 non-null   object 
 10  Total Effects : Houses Destroyed Description  1784 non-null   object 
 11  Total Effects : Houses Damaged Description    821 non-null    object 
dtypes: float64(6), object(6)
memory usage: 759.5+ KB
Total Effects : Deaths Total Effects : Missing Total Effects : Injuries Total Effects : Damage (in M$) Total Effects : Houses Destroyed Total Effects : Houses Damaged
count 1702.000000 25.00000 1259.000000 456.000000 8.170000e+02 4.280000e+02
mean 4228.737368 1910.68000 2379.995234 1892.290559 1.819171e+04 5.882836e+04
std 28267.559410 8667.79685 27424.400752 12469.580794 1.950541e+05 1.015323e+06
min 1.000000 1.00000 1.000000 0.010000 1.000000e+00 1.000000e+00
25% 3.000000 5.00000 10.000000 4.460000 6.100000e+01 9.000000e+01
50% 20.000000 21.00000 40.000000 29.000000 5.000000e+02 6.465000e+02
75% 289.500000 138.00000 200.000000 292.500000 3.600000e+03 2.850000e+03
max 830000.000000 43476.00000 799000.000000 220085.456000 5.360000e+06 2.100000e+07

Even visualy we can see how this matrix seems symmteric over it's antidiagonal. Correlation values between corresponding columns (earhquake and total effects) are:

  • Deaths: 0.98
  • Missing: 1
  • Injuries: 0.96
  • Damage (in M$): 0.59
  • Haauses Destroyed: 1
  • Hausess Damaged: 0.99

So except Damage (in M$) column other pairs of clumns are very highly correlated and there is no point on analysisng both earthquake and total effects separately for numerical columns.

Mean values by columns
Earthquake : Deaths -> 3748.109714838086, Total Effects : Deaths -> 4228.737367802585
Earthquake : Missing -> 2182.7619047619046, Total Effects : Missing -> 1910.68
Earthquake : Injuries -> 2173.5844051446948, Total Effects : Injuries -> 2379.9952343129466
Earthquake : Damage (in M$) -> 1252.0898943248533, Total Effects : Damage (in M$) -> 1892.2905592105262
Earthquake : Houses Destroyed -> 17623.946564885497, Total Effects : Houses Destroyed -> 18191.71481028152
Earthquake : Houses Damaged -> 25133.18775510204, Total Effects : Houses Damaged -> 58828.35514018692

So here, we can see that for almost all columns (except missing which has a lot of null values), total effects have bigger values (which is expected).

Jaccard similarity for categorical columns
Earthquake : Deaths Description, Total Effects : Deaths Description -> 0.9559331290243338
Earthquake : Missing Description, Total Effects : Missing Description -> 0.9035087719298246
Earthquake : Injuries Description, Total Effects : Injuries Description -> 0.9859108466575434
Earthquake : Damage Description, Total Effects : Damage Description -> 0.9573823255610486
Earthquake : Houses Destroyed Description, Total Effects : Houses Destroyed Description -> 0.9513708535574468
Earthquake : Houses Damaged Description, Total Effects : Houses Damaged Description -> 0.9309467061931035

This confrims that categorical column representatives of these groups are very highly correlated.
We will retain only one of these two groups for further analysis.

Null Values, Row and Column Selection¶

Dropping Columns¶

From basic info category flag tsunami and EQ Primary columns will definetely stay in dataset. We have around 50% of null values for focal depth and intensity of earthquakes. Because later we will also drop a lot rows, we will decide later what to do with these columns.

For location columns we will drop region code although we have all data available (there are only 18 distinct values and focus in analysis will be on countries). State column will be retained only for analysis of earthquakes in USA. Location name, will be retained along with countries and latitude and longitude data.

We can see that all data related to date when earthquakes happened are usually present, so these columns will be retained.

For magnitudes we will retain only EQ Primary column, because of reasons given above.

From the chart above we can see that there are more available data for earthquake effects than total effects. Because of that we will retain earthquake effects columns for further analysis and drop total effects columns.

Dataset rows: 6208, columns: 24

Analysis of recorded earthquakes by centuries and decades¶

We can see that there is very small amount of earthquakes recorded before 19th century. Because of that in order to have more consistent analysis we will drop from dataset all earthquakes that happended before that period.

Again we can see steady increase of recorded earthquakes over decades from 19th to 21st century. Question is why is that happening? One of the certain reasons is that with development of technologies our recordings of different factors of earthquakes are more accurate. Let's see what of the 5 reasons (caused deaths, more than 1 million dollars damage, magnitude 7.5 or greater, intensity 10 or more, genarated tsunami) was most common and how that distribution changes over decades.

<AxesSubplot: title={'center': 'Number of earthquakes per decade grouped by satisfied conditions'}, xlabel='Decade'>

Main takeaway here is that earthquakes that have enough damage or number of deaths to be classified as significant increased a lot in more recent decades. We can see that magnittude and intensity caused more earthquakes to be significant in some past decades (peek around 1900. year), so they did not contributed the trend of increasing number of significant earthquakes. We can also observe that number of tsunamis as a couse in most of decades (from 1850. to now) did not changed a lot.

With this we can conclude that there are more earthquakes recorded in this dataset in recent past, because earthqukes now cause more damage and deaths (at least recorded ones). This can be due to more advanced technology that alows better tracking of material damage and deaths in earthquakes, but also due to more people and cities in areas that are prone to earthquakes.

Dropping rows¶

In order to get more consistent analysis, based on previous analysis we will retain only earthquakes that happended after 1960. year.

Handling null values¶

Now we will drop some columns where there are not enough present values. From earthquake effects we will drop missing missing people, houses damaged, houses destroyed and damage in M$ columns. Description of damage, injuries and deaths can stay in dataset. State has small amount of values because it is restricted for USA earthquakes only, so it will remain present in dataset. We will drop Intensity column (Mercalli scale), because of many null values (so for measures of magnitudes we will use just EQ Primary).

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2426 entries, 4216 to 10515
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Flag Tsunami                       2426 non-null   bool   
 1   Year                               2426 non-null   int64  
 2   Month                              2426 non-null   Int32  
 3   Day                                2426 non-null   Int32  
 4   Focal Depth                        2352 non-null   float64
 5   EQ Primary                         2395 non-null   float64
 6   Country                            2426 non-null   object 
 7   State                              143 non-null    object 
 8   Location name                      2426 non-null   object 
 9   Earthquake : Deaths                1053 non-null   float64
 10  Earthquake : Deaths Description    1077 non-null   object 
 11  Earthquake : Injuries              1078 non-null   float64
 12  Earthquake : Injuries Description  1186 non-null   object 
 13  Earthquake : Damage Description    1990 non-null   object 
 14  Latitude                           2425 non-null   object 
 15  Longitude                          2425 non-null   object 
dtypes: Int32(2), bool(1), float64(4), int64(1), object(8)
memory usage: 291.4+ KB
Year Month Day Focal Depth EQ Primary Earthquake : Deaths Earthquake : Injuries
count 2426.000000 2426.0 2426.0 2352.000000 2395.000000 1053.000000 1078.000000
mean 1994.679720 6.470734 15.791838 33.242347 6.118706 1183.420703 2360.817254
std 17.489439 3.420401 8.752634 58.395766 1.033245 13177.577957 28174.601784
min 1960.000000 1.0 1.0 0.000000 1.600000 1.000000 1.000000
25% 1980.000000 4.0 8.0 10.000000 5.400000 2.000000 9.000000
50% 1999.000000 7.0 16.0 21.000000 6.100000 5.000000 36.000000
75% 2009.000000 9.0 23.0 33.000000 6.900000 29.000000 200.000000
max 2020.000000 12.0 31.0 675.000000 9.500000 316000.000000 799000.000000
<AxesSubplot: title={'center': 'Focal depth distribution'}, ylabel='Frequency'>

Because focal depth can have some large numbers, we will impute missing values with median value of this column instead of mean.

Number of missing values for focal depth: 74
Imputing focal depth with median value: 21.0
Number of missing values for focal depth after imputing: 0
<AxesSubplot: title={'center': 'EQ Primary distribution'}, ylabel='Frequency'>

We can impute missing values for EQ Primary with mean value of this column.

Number of missing values for focal depth: 31
Imputing eq primary with median value: 6.118705636743215
Number of missing values for eq primary after imputing: 0

Now let's see what null values we have left

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2426 entries, 4216 to 10515
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Flag Tsunami                       2426 non-null   bool   
 1   Year                               2426 non-null   int64  
 2   Month                              2426 non-null   Int32  
 3   Day                                2426 non-null   Int32  
 4   Focal Depth                        2426 non-null   float64
 5   EQ Primary                         2426 non-null   float64
 6   Country                            2426 non-null   object 
 7   State                              143 non-null    object 
 8   Location name                      2426 non-null   object 
 9   Earthquake : Deaths                1053 non-null   float64
 10  Earthquake : Deaths Description    1077 non-null   object 
 11  Earthquake : Injuries              1078 non-null   float64
 12  Earthquake : Injuries Description  1186 non-null   object 
 13  Earthquake : Damage Description    1990 non-null   object 
 14  Latitude                           2425 non-null   object 
 15  Longitude                          2425 non-null   object 
dtypes: Int32(2), bool(1), float64(4), int64(1), object(8)
memory usage: 291.4+ KB

We can see that there is still one earthquake that does not have latitude and longitude. Let's see what that earthquake is and decide what to do with it.

Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude
ID Earthquake
7775 True 1978 6 22 21.0 6.118706 ITALY NaN ITALY: ADRIATIC SEA NaN NaN NaN NaN NaN NaN NaN

We can see that there is really little data about it and 6.12 magnitude is not that big, so we will drop this row.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2425 entries, 4216 to 10515
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Flag Tsunami                       2425 non-null   bool   
 1   Year                               2425 non-null   int64  
 2   Month                              2425 non-null   Int32  
 3   Day                                2425 non-null   Int32  
 4   Focal Depth                        2425 non-null   float64
 5   EQ Primary                         2425 non-null   float64
 6   Country                            2425 non-null   object 
 7   State                              143 non-null    object 
 8   Location name                      2425 non-null   object 
 9   Earthquake : Deaths                1053 non-null   float64
 10  Earthquake : Deaths Description    1077 non-null   object 
 11  Earthquake : Injuries              1078 non-null   float64
 12  Earthquake : Injuries Description  1186 non-null   object 
 13  Earthquake : Damage Description    1990 non-null   object 
 14  Latitude                           2425 non-null   object 
 15  Longitude                          2425 non-null   object 
dtypes: Int32(2), bool(1), float64(4), int64(1), object(8)
memory usage: 291.3+ KB

We will not impute null values for number of deaths and injuries, for now, because there are too many missing values. Maybe later we will be able to proveide more context for these values.

Statistics by time¶

In previous section we already saw some analysis of earthquake occurences by decades and centuries, but that was done on some larger parts of dataset and for the purpose of data selection.

First we will convert year, month and date to one datetime column, because that will enable easier implementation in some situations.

Year Month Day Date
ID Earthquake
4216 1960 2 29 1960-02-29
4221 1960 4 29 1960-04-29
4257 1962 2 14 1962-02-14
4293 1963 5 19 1963-05-19
4313 1964 4 2 1964-04-02

Number of earthquakes over time¶

<AxesSubplot: title={'center': 'Number of earthquakes per year'}, xlabel='Year'>

We can see that starting from around 2000. year, there is large increase of number of significant earthquakes. In 2020. year we have big decrease, so now we will investigate the reason for that. It is possibility that we do not have complete data for the last year.

<AxesSubplot: title={'center': 'Number of earthquakes per month in 2020'}, xlabel='Date'>

As we assumed, there is no data present after august 2020. year. We need to keep this in mind and maybe drop this year in some of the future analysis.

<AxesSubplot: title={'center': 'Number of earthquakes per month in 1960-2019'}, xlabel='Date'>

As noticed on the whole dataset previously, we do not have big differences of earthquakes happening in different months, so thay are spread evenly, across the year.

<AxesSubplot: title={'center': 'Number of earthquakes per days in a week from 1960-2020'}, xlabel='Date'>

Same as for statistics by month, we can see that there is no big difference in number of earthquakes happening in different days of the week.

Earthquake characteristics over time¶

Let's see how magnitude values of earthquakes changed over the years.

<AxesSubplot: title={'center': 'Magnitude statistics by year'}, xlabel='Year'>

For the maximum magnitude we can notice more significant fluctuations in early years, but all the time they stayed within [7.5, 9.5] interval, so there is no obvious trend here, that for example we have stronger earthquakes in recent years.

In terms of average magnitude, we can see that it is slightly lower in recent years. That is probably due to having more significant earthquakes in the dataset caused by damage and deaths category, so some of the earthquakes with lower magnitude are also included in dataset and they slighly decrease the mean value. This also explains even more obvious trend for minimum values.

Main conclusion here is that we do not have change in how strong earthquakes are recently, comparing to past decades, but more earthquakes are categorized as significant.

<AxesSubplot: title={'center': 'Focal depth statistics by year'}, xlabel='Year'>

We can see that minimums and means stayed same over the years. Maximum values changes a lot, but we cannot say that now we have earthquakes generated on deeper or shallower focal pointns than in the past.

<AxesSubplot: title={'center': 'Number of tsunamis per year'}, xlabel='Year'>
<AxesSubplot: title={'center': 'Percentage of tsunamis per year'}, xlabel='Year'>

From these two charts we can see that number of tsunamis over the years fluctuated a lot [2 to 18], but there is no obvious trend that we have more or less tsunamis in recent years. We can alo notice that there is smaller percentage of earthquakes that are tsunamis in recent years and that is again due to larger number of regular earthquakes (that did not caused tsunamis).

Statistics by location¶

  • by countries
  • by regions
  • maps
  • analysis for specific countries (Turkey, Serbia, ...)

Statistics by countries¶

Maps¶

plate lat lon
0 am 30.754 132.824
1 am 30.970 132.965
2 am 31.216 133.197
3 am 31.515 133.500
4 am 31.882 134.042
Make this Notebook Trusted to load map: File -> Trust Notebook
Make this Notebook Trusted to load map: File -> Trust Notebook
Make this Notebook Trusted to load map: File -> Trust Notebook

Earthquake occurence by time

Make this Notebook Trusted to load map: File -> Trust Notebook

Analysis for specific countries¶

We will analyse earthquakes in Turkey and Serbia (and Balkan region).

Turkey¶

Number if earthquakes in Turkey:  100
Number of earthquakes per year in Turkey:  1.639344262295082
Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude Date
ID Earthquake
5795 False 2004 8 4 10.0 5.6 TURKEY NaN TURKEY: BODRUM NaN NaN 15.0 Few (~1 to 50 deaths) NaN 36.833 27.815 2004-08-04
4488 False 1969 4 30 9.0 5.1 TURKEY NaN TURKEY: DEMIRCI, WESTERN ANATOLIA, ISTANBUL NaN NaN NaN NaN MODERATE (~$1 to $5 million) 39.200 28.600 1969-04-30
5547 False 1999 12 3 19.0 5.7 TURKEY NaN TURKEY: GORESKEN, ERZURUM PROVINCE 1.0 Few (~1 to 50 deaths) 6.0 Few (~1 to 50 deaths) MODERATE (~$1 to $5 million) 40.358 42.346 1999-12-03
5767 False 2004 3 25 21.0 5.6 TURKEY NaN TURKEY: ERZURUM 10.0 Few (~1 to 50 deaths) 46.0 Few (~1 to 50 deaths) LIMITED (roughly corresponding to less than $1... 39.930 40.812 2004-03-25
9833 False 2011 5 19 7.0 4.3 TURKEY NaN TURKEY: SIMAV 2.0 Few (~1 to 50 deaths) 125.0 Many (~101 to 1000 deaths) LIMITED (roughly corresponding to less than $1... 39.120 29.040 2011-05-19
Make this Notebook Trusted to load map: File -> Trust Notebook
Make this Notebook Trusted to load map: File -> Trust Notebook

So in Turkey we have eastern and western region where earthquakes happen. Let's see position of Turkey in some statistics.

Make this Notebook Trusted to load map: File -> Trust Notebook
Turkey is on 37th place in the world (out of 127 countries) by maximum magnitude earthquake.
Turkey is on 6th place in the world (out of 127 countries) by number of earthquakes.
Turkey is on 7th place in the world (out of 127 countries) by number of deaths.

So Turkey did not have aerthquakes with very big magnitudes (37th in world) but had a lot of deaths caused by earthquakes and a lot of damage. This tells us that regions where earthquakes happen probably big density of piopulation and that infrastructure is not good enough to withstand earthquakes.

Now we will consider how expected was this earthquake that happend in 2023. Some of it's statistics are:

  • 7.8 magnitude
  • 50783 deaths and 107204 injuries
  • latitude: 37.2023, longitude: 37.0635
  • depth: 10 km
Maximum magnitude of earthquake in Turkey (1960 - 2020): 7.6
Maximum magnitude of earthquake in Turkey in whole history: 7.6
Text(0.5, 1.0, 'Distribution of magnitudes in Turkey')

From this statistics we can see that earthquake that happende in 2023. in Turkey is strongest in history for Turkey. Let's explore more about it's location.

Make this Notebook Trusted to load map: File -> Trust Notebook

Although this earthquake is placed near edge of tectonic plate, looking at past earthquakes in that region this one was extremely strong in magnitude. Looking into map for death count eastern region of the country had the most extreme cases, but this is still more southern than expected.

Maximum number of deaths in Turkey (1960 - 2020): 17118.0

This number compared to more than 50000 is also unexpected, but this was strongest earthquake in Turkey's history in unexpected place, also near some cities, so that can explain that number. Also Turkey had a dozen earthquakes in extreme category (for deaths) in the past.

Serbia¶

Number if earthquakes in Serbia:  8
Number of earthquakes per year in Serbia:  0.13114754098360656
Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude Date
ID Earthquake
4989 False 1983 9 10 10.0 5.1 SERBIA NaN BALKANS NW: SERBIA NaN NaN NaN NaN MODERATE (~$1 to $5 million) 43.246 20.859 1983-09-10
5044 False 1984 9 7 13.0 4.7 SERBIA NaN BALKANS NW: SERBIA NaN NaN 2.0 Few (~1 to 50 deaths) MODERATE (~$1 to $5 million) 43.314 20.957 1984-09-07
5505 False 1998 9 29 10.0 5.5 SERBIA NaN BALKANS NW: SERBIA: BELGRADE, LJIG, VALJEVO 1.0 Few (~1 to 50 deaths) 17.0 Few (~1 to 50 deaths) MODERATE (~$1 to $5 million) 44.209 20.080 1998-09-29
10137 False 2015 3 8 20.0 4.4 SERBIA NaN BALKANS NW: SERBIA: KOSJERIC NaN NaN NaN NaN LIMITED (roughly corresponding to less than $1... 44.088 19.861 2015-03-08
5632 False 2002 4 24 10.0 5.7 SERBIA NaN BALKANS NW: KOSOVO; MACEDONIA: N 1.0 Few (~1 to 50 deaths) 60.0 Some (~51 to 100 deaths) LIMITED (roughly corresponding to less than $1... 42.436 21.466 2002-04-24
4800 False 1978 4 13 33.0 5.7 SERBIA NaN BALKANS NW: SERBIA: BRUS NaN NaN NaN NaN MODERATE (~$1 to $5 million) 43.269 20.919 1978-04-13
4877 False 1980 5 18 9.0 5.8 SERBIA NaN BALKANS NW: SERBIA NaN NaN 30.0 Few (~1 to 50 deaths) MODERATE (~$1 to $5 million) 43.294 20.837 1980-05-18
9632 False 2010 11 3 1.0 5.5 SERBIA NaN BALKANS NW: SERBIA: KRALJEVO 2.0 Few (~1 to 50 deaths) 100.0 Some (~51 to 100 deaths) SEVERE (~>$5 to $24 million) 43.760 20.673 2010-11-03

So from this we can conclude that around every tenth year Serbia has one significant earthquake. Let's see them on a map.

Map of earthquakes in Serbia with magnitude

Make this Notebook Trusted to load map: File -> Trust Notebook

Map of earthquakes in Serbia by damage

Make this Notebook Trusted to load map: File -> Trust Notebook

So on these maps we can see that earthquake in Kraljevo 2010. year, caused most damage. We also have region around Valjevo which is active and on Kopaonik (which did not caused big material damage).

It is also worth noting that we have region around Skoplje which can have very strong erathquakes near border, so that earthquakes can also affect soutern regions of country.

Let's see where is Serbia positioned worldwide in statistics for various parameters.

Serbia is on 99th place in the world (out of 127 countries) by maximum magnitude earthquake.
Serbia is on 46th place in the world (out of 127 countries) by number of earthquakes.
Serbia is on 75th place in the world (out of 127 countries) by number of deaths.

So Serbia had more significant earthquakes, but thay did not have very big magnitudes and it also belongs to countries with average number of deaths and material damage caused by earthquakes.

Damage, Injuries and Death Analysis¶

Damage analysis¶

LIMITED (roughly corresponding to less than $1 million)    728
MODERATE (~$1 to $5 million)                               698
Unknown                                                    435
SEVERE (~>$5 to $24 million)                               301
EXTREME (~$25 million or more)                             263
Name: Earthquake : Damage Description, dtype: int64
<AxesSubplot: xlabel='Earthquake : Damage Description', ylabel='EQ Primary'>
<AxesSubplot: xlabel='Earthquake : Damage Description', ylabel='Focal Depth'>
Text(0, 0.5, 'Number of earthquakes')

From this we can see that for most of tsunamis we do not have data about damage. Because of that it makes sense to view these two groups separately when analysing damage.

<AxesSubplot: title={'center': 'Damage statistics by year'}>

As we can see from the chart from 2000. to 2010. we had huge increase of number of earthquakes with limited damage. That still does not tell us a lot, because maybein recent years damage was better recorded. So, here we are more interested into severe and extreme damage categories and here we cannot see some obvious trend comparing to past decades, so number of earthquakes with really big damage stayed more or less the same throughout the years.

<AxesSubplot: title={'center': 'Number of extreme damage earthquakes per country (15 countries with greatest number)'}>

China here have a big lead (same as for number of earthquakes), USA was 5th for number of earthquakes, but here is second, Japan also has greater position here than for number of earthquakes. Italy have really big difference, baceuse it was 14th country for number of earthquakes, but here is 4th. We can conclude that here more developed countries have bigger material damage from same amount of earthquakes and that could be due to many factors. It is probably due to more exspensive infrastructure, but also due to more people living in big cities.

Number of deaths Analysis¶

Number of null values for numerical column: 1372
Number of null values for categorical column: 1348

Because we have pretty much same number of values for both columns, it is better for further analysis to focus more on numerical data. We will also add no deaths category for description if there is no data about deaths. (this is just an assumption for now).

Properties of number of deaths
count      1053.000000
mean       1183.420703
std       13177.577957
min           1.000000
25%           2.000000
50%           5.000000
75%          29.000000
max      316000.000000
Name: Earthquake : Deaths, dtype: float64

So we can see that the smallest value here is 1. Because there are definitely some earthquakes without ony detahs recorded in this dataset (they were classified as significant because of other factors), there are probably a lot of earthquakes without casualties among null values. The hypothesis is that we can treat all missing values as earthquakes without deaths, but we need to test this more.

Beside that, we can see that here we have huge standard deviation which tells as that there is a lot of variation in number of deaths caused by earthquakes. There are some very few of very lethal earthquakes, but also many of them with no casualties at all.

Number of earthquakes per category
No deaths                           1348
Few (~1 to 50 deaths)                853
Many (~101 to 1000 deaths)            94
Some (~51 to 100 deaths)              69
Very Many (~1001 or more deaths)      60
None                                   1
Name: Earthquake : Deaths Description, dtype: int64

We have one None value here. We will set it's value to No deaths.

Earthquakes with most death counts (first 10)
Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude Date
ID Earthquake
8732 True 2010 1 12 13.0 7.0 HAITI NaN HAITI: PORT-AU-PRINCE 316000.0 Very Many (~1001 or more deaths) 30000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 18.457 -72.533 2010-01-12
4735 False 1976 7 27 23.0 7.5 CHINA NaN CHINA: NE: TANGSHAN 242769.0 Very Many (~1001 or more deaths) 799000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 39.570 117.980 1976-07-27
7843 True 2008 5 12 19.0 7.9 CHINA NaN CHINA: SICHUAN PROVINCE 87652.0 Very Many (~1001 or more deaths) 374171.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 31.002 103.322 2008-05-12
6778 False 2005 10 8 26.0 7.6 PAKISTAN NaN PAKISTAN: MUZAFFARABAD, URI, ANANTNAG, BARAMULA 76213.0 Very Many (~1001 or more deaths) 146599.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 34.539 73.588 2005-10-08
4531 True 1970 5 31 43.0 7.9 PERU NaN PERU: NORTHERN, PISCO, CHICLAYO 66794.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) -9.200 -78.800 1970-05-31
5248 True 1990 6 20 19.0 7.3 IRAN NaN IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL 40000.0 Very Many (~1001 or more deaths) 105000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 36.957 49.409 1990-06-20
5751 False 2003 12 26 10.0 6.6 IRAN NaN IRAN: SOUTHEASTERN: BAM, BARAVAT 31000.0 Very Many (~1001 or more deaths) 30000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 28.995 58.311 2003-12-26
4600 False 1972 4 10 11.0 6.9 IRAN NaN IRAN: QIR,KARZIN, JAHROM, FIRUZABAD 30000.0 Very Many (~1001 or more deaths) 1700.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 28.400 52.800 1972-04-10
5184 False 1988 12 7 5.0 6.8 ARMENIA NaN ARMENIA: LENINAKAN, SPITAK, KIROVAKAN 25000.0 Very Many (~1001 or more deaths) NaN NaN EXTREME (~$25 million or more) 40.987 44.185 1988-12-07
4711 True 1976 2 4 5.0 7.5 GUATEMALA NaN GUATEMALA: CHIMALTENANGO, GUATEMALA CITY 23000.0 Very Many (~1001 or more deaths) 76000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 15.324 -89.101 1976-02-04
<AxesSubplot: title={'center': 'Distribution of number of deaths'}, ylabel='Frequency'>

Here we can spot some outliers (most devastating earthquakes), while most of them fall into category of 0 to 3000 deaths. Let's create same plot with limiting intervals to better see distribution for smaller values.

No deaths                           1349
Few (~1 to 50 deaths)                853
Many (~101 to 1000 deaths)            94
Some (~51 to 100 deaths)              69
Very Many (~1001 or more deaths)      60
Name: Earthquake : Deaths Description, dtype: int64

Still have similar situations as before, so there are really many earthquakes with very small number of deaths and that scale is preserved for bigger values.

This is a chart with extreamly big peaks. These years that had really big numbers is due to one or few, very devastating earthquakes that year.

Text(0.5, 1.0, 'Relationship between number of deaths and magnitude')

Here we can see how these categories with more deaths required erathquakes to have higher magnitude. On the other hand there are also a lot of earthquakes that did not vaused deaths and also had big magnitudes (isolated areas, not populated areas, great infrastructure, etc.).

Text(0.5, 1.0, 'Relationship between number of deaths and magnitude - regression')

With this we can confirm that correlation between number of detahs and magnitude is positive, but not very strong.

Make this Notebook Trusted to load map: File -> Trust Notebook

Here we can see where in this space are earthquakes with most deaths compared to category with null values. It is visible that they are usually not occupying the same space and that is good for our hypothesis that we can treat null values as earthquakes with no deaths. They are much closer to earthquakes with few deaths and we can also spot 2 regions on chart with higher rate of erthquakes with big number of detahs. We can also see few eearthquakes with large amount of deaths a little bit isolated that have very big value for magnitude. Based on this analysis we will consider null valued number of deaths as zeros.

<AxesSubplot: title={'center': 'Number of deaths per country (15 countries with greatest number)'}, xlabel='Country'>

It is very interesting to compare this chart to the same one representing extreme damaging earthqakes. We saw there that devlepoted coutries (such as USA, Italy, Japan) had a lot of damage, but here we can see that they have much less deaths. That is probably due to better infrastructure and better preparedness of people for earthquakes.

Number of injuries analysis¶

Number of null values for numerical column: 1347
Number of null values for categorical column: 1239

These numbers of available values resenbles a lot number of deaths. Same as for deaths, we will assume that null values are earthquakes with no injuries (and try to prove that later)

Properties of number of deaths
count      1078.000000
mean       2360.817254
std       28174.601784
min           1.000000
25%           9.000000
50%          36.000000
75%         200.000000
max      799000.000000
Name: Earthquake : Injuries, dtype: float64

Here we have same observations as for deaths, but values are generally greater.

Number of earthquakes per category
No injuries                         1239
Few (~1 to 50 deaths)                642
Many (~101 to 1000 deaths)           274
Some (~51 to 100 deaths)             158
Very Many (~1001 or more deaths)     112
Name: Earthquake : Injuries Description, dtype: int64
Earthquakes with most injuries counts (first 10)
Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude Date
ID Earthquake
4735 False 1976 7 27 23.0 7.5 CHINA NaN CHINA: NE: TANGSHAN 242769.0 Very Many (~1001 or more deaths) 799000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 39.570 117.980 1976-07-27
7843 True 2008 5 12 19.0 7.9 CHINA NaN CHINA: SICHUAN PROVINCE 87652.0 Very Many (~1001 or more deaths) 374171.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 31.002 103.322 2008-05-12
5589 False 2001 1 26 16.0 7.7 INDIA NaN INDIA: GUJARAT: BHUJ, AHMADABAD, RAJOKOT; PA... 20005.0 Very Many (~1001 or more deaths) 166836.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 23.419 70.232 2001-01-26
6778 False 2005 10 8 26.0 7.6 PAKISTAN NaN PAKISTAN: MUZAFFARABAD, URI, ANANTNAG, BARAMULA 76213.0 Very Many (~1001 or more deaths) 146599.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 34.539 73.588 2005-10-08
5248 True 1990 6 20 19.0 7.3 IRAN NaN IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL 40000.0 Very Many (~1001 or more deaths) 105000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 36.957 49.409 1990-06-20
4711 True 1976 2 4 5.0 7.5 GUATEMALA NaN GUATEMALA: CHIMALTENANGO, GUATEMALA CITY 23000.0 Very Many (~1001 or more deaths) 76000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 15.324 -89.101 1976-02-04
4531 True 1970 5 31 43.0 7.9 PERU NaN PERU: NORTHERN, PISCO, CHICLAYO 66794.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) -9.200 -78.800 1970-05-31
5527 True 1999 8 17 13.0 7.6 TURKEY NaN TURKEY: ISTANBUL, KOCAELI, SAKARYA 17118.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 40.760 29.970 1999-08-17
7245 False 2006 5 26 13.0 6.3 INDONESIA NaN INDONESIA: JAVA: BANTUL, YOGYAKARTA 5749.0 Very Many (~1001 or more deaths) 38568.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) -7.961 110.446 2006-05-26
5399 True 1995 1 16 22.0 6.9 JAPAN NaN JAPAN: SW HONSHU: KOBE, AWAJI-SHIMA, NISHINO... 5502.0 Very Many (~1001 or more deaths) 36896.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 34.583 135.018 1995-01-16

We cannot see Haiti earthquake here, because it had only 30000 injured people. This could be easily mistake in the dataset, because some other sources reported around 10x more injured people in that earthquake. We will correct that now.

Everything else, compared to deaths, make sense.

Earthquakes with most injuries counts (first 10)
Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude Date
ID Earthquake
4735 False 1976 7 27 23.0 7.5 CHINA NaN CHINA: NE: TANGSHAN 242769.0 Very Many (~1001 or more deaths) 799000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 39.570 117.980 1976-07-27
7843 True 2008 5 12 19.0 7.9 CHINA NaN CHINA: SICHUAN PROVINCE 87652.0 Very Many (~1001 or more deaths) 374171.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 31.002 103.322 2008-05-12
8732 True 2010 1 12 13.0 7.0 HAITI NaN HAITI: PORT-AU-PRINCE 316000.0 Very Many (~1001 or more deaths) 300000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 18.457 -72.533 2010-01-12
5589 False 2001 1 26 16.0 7.7 INDIA NaN INDIA: GUJARAT: BHUJ, AHMADABAD, RAJOKOT; PA... 20005.0 Very Many (~1001 or more deaths) 166836.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 23.419 70.232 2001-01-26
6778 False 2005 10 8 26.0 7.6 PAKISTAN NaN PAKISTAN: MUZAFFARABAD, URI, ANANTNAG, BARAMULA 76213.0 Very Many (~1001 or more deaths) 146599.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 34.539 73.588 2005-10-08
5248 True 1990 6 20 19.0 7.3 IRAN NaN IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL 40000.0 Very Many (~1001 or more deaths) 105000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 36.957 49.409 1990-06-20
4711 True 1976 2 4 5.0 7.5 GUATEMALA NaN GUATEMALA: CHIMALTENANGO, GUATEMALA CITY 23000.0 Very Many (~1001 or more deaths) 76000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 15.324 -89.101 1976-02-04
4531 True 1970 5 31 43.0 7.9 PERU NaN PERU: NORTHERN, PISCO, CHICLAYO 66794.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) -9.200 -78.800 1970-05-31
5527 True 1999 8 17 13.0 7.6 TURKEY NaN TURKEY: ISTANBUL, KOCAELI, SAKARYA 17118.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 40.760 29.970 1999-08-17
7245 False 2006 5 26 13.0 6.3 INDONESIA NaN INDONESIA: JAVA: BANTUL, YOGYAKARTA 5749.0 Very Many (~1001 or more deaths) 38568.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) -7.961 110.446 2006-05-26
<AxesSubplot: title={'center': 'Distribution of number of injuries'}, ylabel='Frequency'>
No injuries                         1239
Few (~1 to 50 deaths)                642
Many (~101 to 1000 deaths)           274
Some (~51 to 100 deaths)             158
Very Many (~1001 or more deaths)     112
Name: Earthquake : Injuries Description, dtype: int64

All this statistics are very similar to deaths. Also notice that categories here is sazing deaths instead of injuries, so we will correct that.

<AxesSubplot: title={'center': 'Number of injuries from earthquakes per year'}, xlabel='Year'>

This chart has even bigger peeks than for deths, but the pattern overall is same.

Text(0.5, 1.0, 'Relationship between number of injuries and magnitude')

This is again very similar to same chart for deaths.

Make this Notebook Trusted to load map: File -> Trust Notebook

Similar chart as for detahs data, but now we have more aerthquakes in the space with big number of injuries.

It seems now that there are more interlapping between categories with more injuries and category with null values.

Deaths null:  1372
Injuries null:  1347
Both null:  1010

As we can see most null values here overlap, so if we treated null values for deaths as zeros, we can do the same for injuries.

<AxesSubplot: title={'center': 'Number of injuries per country (15 countries with greatest number)'}, xlabel='Country'>

Here China is dominating by far, but conclusions are same as for deaths.

Tsunami Analysis¶

Number of tsunamis:  572
Number of normal earthquakes:  1853
Tsunami percentage: 23.59%

So roughly every forth earthquake caused tsunami.

Make this Notebook Trusted to load map: File -> Trust Notebook

Of course as expected we have tsunamies near shorelines, we can see very big concentration of them In Japan and Indonisian islands.

<AxesSubplot: title={'center': 'Number of tsunamies per year'}, xlabel='Year'>

So considering number of tsunamies we cannot see any obcvious trend, but it can definetely fluctuate from year to year.

<AxesSubplot: title={'center': 'Countries with most tsunamis (15 countries with greatest number)'}, xlabel='Country'>

Now Japan has most tsunamis, followed by Indonesia and Russia. So this is much different than general earthquake number, because now different regions are targeted.

<AxesSubplot: title={'center': 'Countries with most deaths caused by tsunamis (15 countries with greatest number)'}, xlabel='Country'>

So Haiti is first because of big tsunami that happened there. We can also see how Japan is here at only 10th place, although it is a country with most tsunamies. That is probably due to great protection of tsunamies and earthquakes in that country. Russia is also only in 14th place, while it was in 3rd place for number of tsunamies.

Average deaths from tsunamis:  3688.497005988024
Average deaths from normal earthquakes:  711.244920993228
Max deaths from tsunamis:  316000.0
Max deaths from normal earthquakes:  242769.0

So at average tsunamies caused more detahs than normal earthquakes, but maximal values belong to normal earthquakes.

<AxesSubplot: title={'center': 'Number of deaths distribution for tsunamies'}, ylabel='Frequency'>

This still looks similar as for normal earthquakes, but it is a little bit more spread.

<AxesSubplot: xlabel='Year'>

So recently we can see that tsunamies caused more deaths than normal earthquakes (around 2005-2010). Let's discover which tsunamies caused that.

Flag Tsunami Year Month Day Focal Depth EQ Primary Country State Location name Earthquake : Deaths Earthquake : Deaths Description Earthquake : Injuries Earthquake : Injuries Description Earthquake : Damage Description Latitude Longitude Date
ID Earthquake
8732 True 2010 1 12 13.0 7.0 HAITI NaN HAITI: PORT-AU-PRINCE 316000.0 Very Many (~1001 or more deaths) 30000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 18.457 -72.533 2010-01-12
7843 True 2008 5 12 19.0 7.9 CHINA NaN CHINA: SICHUAN PROVINCE 87652.0 Very Many (~1001 or more deaths) 374171.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 31.002 103.322 2008-05-12
4531 True 1970 5 31 43.0 7.9 PERU NaN PERU: NORTHERN, PISCO, CHICLAYO 66794.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) -9.200 -78.800 1970-05-31
5248 True 1990 6 20 19.0 7.3 IRAN NaN IRAN: RASHT, QAZVIN, ZANJAN, RUDBAR, MANJIL 40000.0 Very Many (~1001 or more deaths) 105000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 36.957 49.409 1990-06-20
4711 True 1976 2 4 5.0 7.5 GUATEMALA NaN GUATEMALA: CHIMALTENANGO, GUATEMALA CITY 23000.0 Very Many (~1001 or more deaths) 76000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 15.324 -89.101 1976-02-04
5527 True 1999 8 17 13.0 7.6 TURKEY NaN TURKEY: ISTANBUL, KOCAELI, SAKARYA 17118.0 Very Many (~1001 or more deaths) 50000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 40.760 29.970 1999-08-17
4216 True 1960 2 29 33.0 5.9 MOROCCO NaN MOROCCO: AGADIR 13100.0 Very Many (~1001 or more deaths) 25000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 30.450 -9.620 1960-02-29
5076 True 1985 9 19 28.0 8.1 MEXICO NaN MEXICO: MICHOACAN: MEXICO CITY 9500.0 Very Many (~1001 or more deaths) 30000.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 18.190 -102.533 1985-09-19
5399 True 1995 1 16 22.0 6.9 JAPAN NaN JAPAN: SW HONSHU: KOBE, AWAJI-SHIMA, NISHINO... 5502.0 Very Many (~1001 or more deaths) 36896.0 Very Many (~1001 or more deaths) EXTREME (~$25 million or more) 34.583 135.018 1995-01-16
4671 True 1974 12 28 22.0 6.2 PAKISTAN NaN PAKISTAN: BALAKOT, PATAN 5300.0 Very Many (~1001 or more deaths) 17000.0 Very Many (~1001 or more deaths) MODERATE (~$1 to $5 million) 35.100 72.900 1974-12-28

Now we can see that Haiti earthquake was indeed a tsunami, so that contributed to these values.

Tsunamies with extreme damage:  63
Normal earthquakes with extreme damage:  200
damage percentage:  31.5

So it is a little bit more likely that tsunami will cause extreme damage than earthqauke, comparing this values with frequency of tsunamies.